KMID : 1137820220430020109
|
|
ÀÇ°øÇÐȸÁö 2022 Volume.43 No. 2 p.109 ~ p.115
|
|
Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification
|
|
Yoo Sung-Lim
|
|
Abstract
|
|
|
Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic med- ical records classification.
Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization tech- niques.
Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vec- torization techniques.
Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.
|
|
KEYWORD
|
|
Natural language processing, Medical records classification, Vectorization techniques, Machine learning, Latent semantic analysis
|
|
FullTexts / Linksout information
|
|
|
|
Listed journal information
|
|
|